Dev/spark backend new comparator by TonyKatkov89 · Pull Request #213 · sb-ai-lab/HypEx

TonyKatkov89 · 2026-03-13T13:49:43Z

Stats-based comparator for Spark-efficient hypothesis testing

Summary

Refactor comparator hierarchy: extract BaseComparator as a shared root, keeping GroupsComparator (former Comparator) for raw-data comparisons and adding the new StatsComparator for aggregation-based comparisons.
Add StatsComparator: a two-phase abstract comparator that operates on pre-aggregated sufficient statistics instead of raw data slices. Phase 1 issues a single .agg() call across all target columns and groups; Phase 2 runs analytical tests on the returned scalar dicts — entirely driver-side. This reduces Spark jobs from O(columns × groups) to a constant one per executor.
Add AggTTest: a concrete StatsComparator implementing Welch's t-test from {mean, var, count} statistics. Produces the same output shape as TTest and is a drop-in replacement in pipelines where raw data transfer is expensive.
Fix GroupedDataset.agg with list input: flatten the Pandas MultiIndex produced by list-style aggregation into {col}┆{stat} column names, which StatsComparator.execute relies on.

Motivation

The existing Comparator/GroupsComparator pattern pulls raw group data to the driver before running statistical tests. On a Spark backend this causes one separate distributed job per (group pair × column), which is prohibitively slow for wide datasets. StatsComparator + AggTTest solve this by aggregating in one distributed pass and doing the math locally on small scalar dicts.

Files changed

File	Change
`hypex/comparators/abstract.py`	New `BaseComparator`, `GroupsComparator` (renamed from `Comparator`), `StatsComparator`
`hypex/comparators/stats_hypothesis_testing.py`	New `AggTTest`
`hypex/comparators/__init__.py`	Export new public classes
`hypex/dataset/groupby_dataset.py`	Fix MultiIndex flattening in `agg()`
`hypex/utils/constants.py` / `__init__.py`	Remove duplicate constant definitions

Test plan

Run existing comparator tests to verify GroupsComparator (formerly Comparator) behaviour is unchanged
Verify AggTTest p-values match TTest on the same dataset
Test GroupedDataset.agg with a list of stat functions produces flat col┆stat column names
Smoke-test StatsComparator/AggTTest against a Spark-backed ExperimentData

fixed set_value

…ation and PandasDataset

…icate code get_values and iget_values from PandasDataset

…ted methods

…on some methods

- Fix __init__ to conditionally initialize physical index based on flag - Implement iloc with physical_index_actual_flag checking - Add physical_index_actual_flag=False to loc, sort_values, dropna - Exclude utility columns from fillna, drop, rename operations - Use _public_columns in agg, mode, log to avoid utility column processing - Add warnings when user attempts to modify utility columns BREAKING CHANGE: iloc now requires physical_index_actual_flag to be True

…ues, from_dict, to_dict, to_records, index. Testing and limitation required.

Testing and limitation required.

…park' into dev/spark_backend_new_comparator # Conflicts: # hypex/dataset/backends/pandas_backend.py # hypex/dataset/backends/spark_backend.py

# Conflicts: # hypex/dataset/abstract.py # hypex/dataset/backends/pandas_backend.py # hypex/dataset/backends/spark_backend.py # hypex/dataset/dataset.py # hypex/utils/__init__.py # hypex/utils/typings.py # tests/test_spark_backend.ipynb

yurashku and others added 30 commits February 11, 2026 17:46

*

fb5c3ea

added dependencies

e56badf

added small dataset

cf335d9

fix

8c29618

fixed from_dict method

7fe72f8

fixed to_small_dataset

93c1e43

fixed set_value

first sekizo

bc553fb

second sekizo

861b40b

docs(pandas_backend.py): add docstring for all methods in PandasNavig…

784633e

…ation and PandasDataset

fix(pandas_backend.py): fix bug in get_numeric_columns; removing dupl…

db0941b

…icate code get_values and iget_values from PandasDataset

feat(spark_backend.py): add index without implementation not implemen…

ad5a97e

…ted methods

feat/refactor(spark_backend.py): add physical index with implementati…

65a1dd6

…on some methods

feat(spark_backend): implementation __getitem__, get_values, iget_val…

e9cefd0

…ues, from_dict, to_dict, to_records, index. Testing and limitation required.

feat(spark_backend): implement add_column, get, take и append.

f16e21b

Testing and limitation required.

StatsComparator added

f6c7172

StatsComparator added

76ed478

feat(spark_backend): rewrite spark backend on pandas api on spark.

e0d81d3

feat/fix(abstract/typings): Adaptation to spark backend

c410645

fix(spark_backend): massive fix

b6e183e

fix(abstract/spark_backend): fix groupby problems.

5747702

fix(abstract/spark_backend): massive fix.

f99f6b4

test(spark_backend): add tests notebook spark backend.

d46674d

artifact(spark_backend): add artifact spark backend native.

d78cc13

chore(pyproject.toml): add dep

fcd87e5

fix(hypex): del one file

511d28d

fix(pandas_backend): fix groupby pandas.

f7038f1

refactor(abstract/spark_backend): fix typehints.

ec45a51

refactor(spark_dataset): fix to_pandas call.

31cd179

fix(spark_backend): fix after pr

504158a

TonyKatkov89 added 5 commits March 10, 2026 17:05

Merge remote-tracking branch 'origin/dev/spark_backend_pandasapi_on_s…

9cc2f14

…park' into dev/spark_backend_new_comparator # Conflicts: # hypex/dataset/backends/pandas_backend.py # hypex/dataset/backends/spark_backend.py

stats comparator optimization

0ef1877

stats comparator optimization

db6ef8f

groupby_dataset.py added

3932f9d

TonyKatkov89 added this to the 1.1 milestone Mar 13, 2026

TonyKatkov89 requested a review from Mkrie March 13, 2026 13:49

TonyKatkov89 self-assigned this Mar 13, 2026

TonyKatkov89 added the enhancement New feature or request label Mar 13, 2026

name border symbol import added

1af984c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/spark backend new comparator#213

Dev/spark backend new comparator#213
TonyKatkov89 wants to merge 36 commits intodev/spark_backendfrom
dev/spark_backend_new_comparator

TonyKatkov89 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TonyKatkov89 commented Mar 13, 2026

Stats-based comparator for Spark-efficient hypothesis testing

Summary

Motivation

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants